智能论文笔记

Topical: Learning Repository Embeddings from Source Code using Attention

Agathe Lherondelle , Yash Satsangi , Fran Silavong , Shaltiel Eloul , Sean Moran

分类：人工智能

2022-08-19

源代码（MLONCODE）上的机器学习有望改变软件的交付方式。通过挖掘软件伪像之间的上下文和关系，mloncode通过代码自动生成，代码建议，代码自动标记和其他数据驱动的增强功能增强了软件开发人员的功能。对于许多任务中，代码的脚本级别表示足够，但是，在许多情况下，要考虑各种依赖关系和存储库结构的存储库级表示，例如，自动标记存储库具有主题或自动记录的存储库。代码等，用于计算存储库级表示的现有方法受（a）依赖代码的自然语言文档（例如，读书文件）（b）方法/脚本级表示的天真聚集，例如，通过串联或平均值。本文介绍了一个深度神经网络，该网络可直接从源代码中生成可公开可用的GitHub代码存储库的存储库嵌入。主题结合了一种注意机制，该机制将源代码，完整依赖关系图和脚本级别的文本信息投射到密集的存储库级表示中。为了计算存储库级别的表示，局部训练可以预测与存储库相关的主题，该主题是在公开可用的GitHub存储库数据集中，这些存储库与他们的地面真相主题标签一起爬行。我们的实验表明，局部计算的嵌入能够胜过多个基线，包括通过在存储库自动标记的任务下平均或串联来天真地结合方法级表示的基线。

translated by 谷歌翻译

We present a new pre-trained language model (PLM) for modern Hebrew, termed AlephBERTGimmel, which employs a much larger vocabulary (128K items) than standard Hebrew PLMs before. We perform a contrastive analysis of this model against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the effects of larger vocabularies on task performance. Our experiments show that larger vocabularies lead to fewer splits, and that reducing splits is better for model performance, across different tasks. All in all this new model achieves new SOTA on all available Hebrew benchmarks, including Morphological Segmentation, POS Tagging, Full Morphological Analysis, NER, and Sentiment Analysis. Subsequently we advocate for PLMs that are larger not only in terms of number of layers or training data, but also in terms of their vocabulary. We release the new model publicly for unrestricted use.

translated by 谷歌翻译

我们为拉比希伯来语提出了一种新的预训练的语言模型（PLM），称为Berel（Bert bert嵌入了拉比编码的语言）。尽管存在其他PLM用于处理希伯来文本（例如Hebert，Alephbert），但它们都接受了现代希伯来语文本的培训，该文本在其词典，形态学，义学和正学规范方面与犹太人希伯来语有很大的不同。我们通过一组希伯来语同源物来证明贝雷尔在拉比文本上的优越性。我们发布了无限制使用的新模型和同型挑战。

translated by 谷歌翻译